Testing Hypotheses with Induction Trees
نویسنده
چکیده
Data mining has brought a new philosophy in data analysis that is primarily driven by computational efficiency and predictive performance. In this paper we attempt to show that while this new philosophy opens new horizons, these new techniques may themselves gain much when coupled with traditional statistical reasoning. For example, I recently introduced some sociologists of the University of Geneva to induction trees. These social scientists were impressed by the ease with which such tools allowed them to extract valuable knowledge from their datasets. However, since they were used to fit statistical models like linear or logistic regressions or even multinomial log-linear models, they naturally wanted to know how well the induced trees fit their data. They also wanted to test the significance of specific branch expansions and compare them with alternatives that they found more meaningful. The classification error rates let them unsatisfied. Being primarily interested in how the predictors jointly affect the distribution of the response variable rather than in classification, they expected indeed some divergence Chi-square statistics and inferential tools for comparing alternative structures. Unfortunately they did not found such information in the software outcomes. It is indeed characteristic of data mining and especially of machine learning to focus on the usefulness and predictive performance of the induced rules and to neglect somehow their descriptive content. Thus, the rules are most often used as black boxes. They provide, however, as our sociologists discovered it, also very useful descriptive knowledge about the phenomenon under study. It makes then sense to statistically validate the description provided. Only few attention has been given so far to this aspect. Textbooks, like Han and Kamber (2001) for example, don’t mention it, and, as far as prediction rules are concerned, statistical learning (see Hastie et al., 2001, chap. 7) concentrates on the statistical properties of the classification error rate. This lack of inferential tools for the descriptive content of classification rules motivated this paper. Focusing on induction trees with categorical variables, we propose a simple trick that permits to apply to them the inferential tools used for instance in the statistical log-modeling of multinomial cross tables. The paper is organized as follows. Section 2 discusses the fit issue and introduces the trick that renders induced trees conformable with the requirements of Chi-square statistics. Section 4 is devoted to tree comparison and shows how tests of hypotheses about the tree structure can be carried out with the deviance statistic. Section 5 provides concluding remarks.
منابع مشابه
Overprvning Large Decision Trees
This paper presents empirical evidence for five hypotheses about learning from large noisy domains: that trees built from very large training sets are larger and more accurate than trees built from even large subsets; that this increased accuracy is only in part due to the extra size of the trees; and that the extra training instances allow both better choices of attribute while building the tr...
متن کاملTesting fuzzy hypotheses with vague data
The problem of testing fuzzy hypotheses in the presence of vague data is considered. A new method based on the necessity index of strict dominance (NSD) is suggested. An example hoe to apply the proposed test in statistical quality control is shown.
متن کاملFuzzy decision making in testing hypotheses: An introduction to the packages ``FPV" and ``Fuzzy.p.value" with practical examples
This paper reviews and compares two R packages ``FPV" and ``Fuzzy.p.value".These packages are designed for testing hypotheses in a fuzzy environment using a fuzzy $p$-value based approach.In fact, the packages ``FPV" and ``Fuzzy.p.value" propose some useful functions for testing hypotheses when the data / hypotheses are fuzzy rather than crisp.The proposed methods and function...
متن کاملTESTING STATISTICAL HYPOTHESES UNDER FUZZY DATA AND BASED ON A NEW SIGNED DISTANCE
This paper deals with the problem of testing statisticalhypotheses when the available data are fuzzy. In this approach, wefirst obtain a fuzzy test statistic based on fuzzy data, and then,based on a new signed distance between fuzzy numbers, we introducea new decision rule to accept/reject the hypothesis of interest.The proposed approach is investigated for two cases: the casewithout nuisance p...
متن کاملFuzzy decision in testing hypotheses by fuzzy data: Two case studies
In testing hypotheses, we may confront with cases where data are recorded as non-precise (fuzzy) rather than crisp. In such situations, the classical methods of testing hypotheses are not capable and need to be generalized. In solving the problem of testing hypotheses based on fuzzy data, the fuzziness of the observed data leads to the fuzzy p-value. This paper has been focused to calculate fuz...
متن کاملCN2-MCI: A Two-Step Method for Constructive Induction
Methods for constructive induction perform an automatic transformation of description spaces if representational shortcomings deteriorate the quality of learning. In the context of concept learning and propositional representation languages, feature construction algorithms have been developed in order to improve the accuracy and to decrease the complexity of hypotheses. Particularly, so-called ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1999